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O ! Abstract 

o . 

{SI . In this paper we consider two closely related problems : estimation of eigenvalues and eigen- 

functions of the covariance kernel of functional data based on (possibly) irregular measurements, 
C3 ■ and the problem of estimating the eigenvalues and eigenvectors of the covariance matrix for 

^) | high-dimensional Gaussian vectors. In Peng and Paul (2007), a restricted maximum likelihood 

(REML) approach has been developed to deal with the first problem. In this paper, we estab- 
' lish consistency and derive rate of convergence of the REML estimator for the functional data 

■ case, under appropriate smoothness conditions. Moreover, we prove that when the number of 

£-H \ measurements per sample curve is bounded, under squared-error loss, the rate of convergence of 

• the REML estimators of eigenfunctions is near-optimal. In the case of Gaussian vectors, asymp- 

totic consistency and an efficient score representation of the estimators are obtained under the 
assumption that the effective dimension grows at a rate slower than the sample size. These 
results are derived through an explicit utilization of the intrinsic geometry of the parameter 
space, which is non-Euclidean. Moreover, the results derived in this paper suggest an asymp- 
totic equivalence between the inference on functional data with dense measurements and that 
— . of the high dimensional Gaussian vectors. 

> • 

in : 

1 Introduction 

Analysis of functional data, where the measurements per subject, or replicate, are taken on a 
finite interval, has been one of the growing branches of statistics in recent times. In fields such 
as longitudinal data analysis, chemometrics, econometrics, the functional data analysis viewpoint 
has been successfully used to summarize data and gain better understanding of the problems 
t> ■ at hand. The monographs of Ramsay and Silverman (2005) and Ferraty and Vieu (2006) give 
detailed accounts of the applications of functional data approach to various problems in these 
fields. Depending on how the individual curves are measured, one can think of two different 
scenarios - (i) when the curves are measured on a dense grid; and (ii) when the measurements are 
observed on an irregular, and typically sparse set of points on an interval. The first situation usually 
arises when the data are recorded by some automated instrument, e.g. in chemometrics, where the 
curves represent the spectra of certain chemical substances. The second scenario is more typical 
in longitudinal studies where the individual curves could represent the level of concentration of 
some substance, and the measurements on the subjects may be taken only at irregular time points. 
In the first scenario, i.e., data on a regular grid, as long as the individual curves are smooth, the 
measurement noise level is low, and the grid is dense enough, one can essentially treat the data 
to be on a continuum, and employ techniques similar to the ones used in classical multivariate 
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analysis. For example, Hall and Hosseini-Nasab (2006) derive stochastic expansions of sample 
PCA when the sample curves are noise-free and measured on a continuum. However, in the second 
scenario, the irregular nature of the data, and the presence of measurement noise pose challenges 
and require a different treatment. Under such a scenario, data corresponding to individual subjects 
can be viewed as partially observed, and noise-corrupted, independent realizations of an underlying 
stochastic process. The estimation of the eigenvalues and eigenfunctions of a smooth covariance 
kernel, from sparse, irregular measurements, has been studied by various authors including James, 
Hastie and Sugar (2000), Yao, Miiller and Wang (2005), and Peng and Paul (2007), among others. 

In Peng and Paul (2007), a restricted maximum likelihood (REML) approach is taken to obtain 
the estimators. REML estimators are widely used and studied in statistics. For example, the 
usefulness of REML and profile REML estimation has been recently demonstrated in the context 
of functional linear mixed effects model by Antoniadis and Sapatinas (2007). In Peng and Paul 
(2007), it is assumed that the covariance kernel can be well-approximated by a positive-semidefinite 
kernel of finite rank r whose eigenfunctions can be represented by M(> r) known orthonormal basis 
functions. Thus the basis coefficient matrix B of the approximant belongs to the Stiefel manifold 
of M X r matrices with orthonormal columns. The working assumption of Gaussianity allows the 
authors to derive the log- likelihood of the observed data given the measurement times. Then a 
Newton-Raphson procedure, that respects the geometry of the parameter space, is employed to 
obtain the estimates by maximizing the log-likelihood. This procedure is based on the formulation 
of a general Newton-Raphson scheme on Stiefel manifold developed in Edelman, Arias and Smith 
(1998). Peng and Paul (2007) also derive a computationally efficient approximate cross-validation 
score for selecting M and r. Through extensive simulation studies, it is demonstrated that the 
REML estimator is much more efficient than an alternative procedure (Yao et al, 2005) based 
on local linear smoothing of empirical covariances. The latter estimator does not naturally reside 
in the parameter space, even though it has been proved to achieve the optimal non-parametric 
convergence rate in the minimax sense under I 2 loss, under the optimal choice of the bandwidth 
and when the number of measurements per curve is bounded (Hall, Miiller and Wang, 2006). Also, 
in most situations, our method outperforms the EM approach of James et al. (2000). Although 
the latter estimator also aims to maximize the log-likelihood, it does not naturally reside in the 
parameter space either, and thus it does not utilize its geometry efficiently. 

The superior numerical performance of the REML estimator motivates us to conduct a detailed 
study of its asymptotic properties. In this paper, we establish consistency and derive the rate of 
convergence (under I 2 loss) of the REML estimator when the eigenfunctions have a certain degree 
of smoothness, and when a stable and smooth basis, e.g., the cubic B-spline basis with a pre- 
determined set of knots, is used for approximating them. The techniques used to prove consistency 
differ from the standard asymptotic analysis tools when the parameter space is Euclidean. Specif- 
ically, we restrict our attention to small ellipsoids around zero in the tangent space to establish a 
mathematically manageable neighborhood around an "optimal parameter" (a good approximation 
of the "true parameter" within the model space). We derive asymptotic results when the number 
of measurements per curve grows sufficiently slowly with the sample size (referred as the sparse 
case). We also show that for a special scenario of the sparse case, when there is a bounded number 
of measurements per curve, the risk of the REML estimator (measured in squared-error loss) of 
the eigenfunctions has asymptotically near-optimal rate (i.e., within a factor of logn of the optimal 
rate) under an appropriate choice of the number of basis functions. 

Besides the sparse case, we consider two other closely related problems: (i) the estimation of 
the eigenvalues and eigenfunctions of a smooth covariance kernel, from dense, possibly irregular, 
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measurements (referred as the dense case); and (ii) the estimation of the eigenvalues and eigen- 
vectors of a high-dimensional covariance matrix (referred as the matrix case). In the matrix case, 
we assume that there is preliminary information so that the data can be efficiently approximated 
in a lower dimensional known linear space whose effective dimension grows at a rate slower than 
the sample size n. The proofs of the results in all three cases utilize the intrinsic geometry of the 
parameter space through a decomposition of the Kullback-Leibler divergence. However, the matrix 
case and the dense case are more closely related, and the techniques for proving the results in these 
cases are different in certain aspects from the treatment of the sparse case, as described in Sections 
[2]and[3l 

Moreover, in the matrix case, we also derive a semiparametric efficient score representation 
of the REML estimator (Theorem 4), that is given in terms of the intrinsic Fisher information 
operator (note that the residual term is not necessarily op(n -1 / 2 )). This result is new, and explicitly 
quantifies the role of the intrinsic geometry of the parameter space on the asymptotic behavior of 
the estimators. Subsequently, it points to an asymptotic optimality of the REML estimators. 
Here, asymptotic optimality means achieving the asymptotic minimax risk under I 2 loss within a 
suitable class of covariance matrices (kernels). We want to point out that, in the matrix case, the 
REML estimators coincide with the usual PC A estimates, i.e., the eigenvalues and eigenvectors 
of the sample covariance matrix (Muirhead, 1982). In Paul and Johnstone (2007), a first order 
approximation of the PCA estimators is obtained by matrix perturbation analysis. Our current 
results show that the efficient score representation coincides with this approximation, and thereby 
gives a geometric interpretation to this. The asymptotically optimal rate of the £ 2 -risk of the REML 
estimator in the matrix case follows from this representation and the lower bound on the minimax 
rate obtained in Paul and Johnstone (2007). Asymptotic properties of high-dimensional PCA under 
a similar context have also been studied by Fan, Fan and Lv (2007). Recently several approaches 
have been proposed for estimating large dimensional covariance matrices and their eigenvalues and 
eigenvectors under suitable sparsity assumptions on the population covariance, e.g. Bickel and 
Levina (2007, 2008) and El Karoui (2008). 

At this point, we would like to highlight the main contributions of this paper. First, we have 
established the consistency and derived the rate of convergence of REML estimators for functional 
principal components in two different regimes - the sparse case and the dense case. In Hall et al. 
(2006), it is shown that an estimator of functional principal component based on a local polynomial 
approach achieves the optimal nonparametric rate when the number of measurements per curve 
is bounded. However, to the best of our knowledge, no results exist regarding the consistency, or 
rate of convergence, of the REML estimators in the functional data context. Secondly, we have 
derived an efficient score representation for sample principal components of high-dimensional, i.i.d. 
Gaussian vectors. This involves calculation of the intrinsic Fisher information operator and its 
inverse, and along the line we also provide an independent verification that the REML estimates 
under a rank-restricted covariance model are indeed the PCA estimates. Thirdly, we expect that the 
current framework can be refined to establish efficient score representation of the REML estimators 
of the functional principal components, and therefore the results obtained in this paper serve as first 
steps towards studying the asymptotic optimality of these estimators. Moreover, results obtained 
in this paper suggest an asymptotic equivalence between the inference on functional data with 
dense measurements and that of the high dimensional Gaussian vectors. Finally, our work provides 
useful techniques for dealing with the analysis of estimation procedures based on minimization 
of a loss function (e.g. MLE, or more generally M-estimators) over a non-Euclidean parameter 
space for semiparametric problems. There has been some work on analysis of maximum likelihood 
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estimators for parametric problems when the parameter space is non-Euclidean (see e.g. Oiler and 
Corcuera, 1995). However, there has been very limited work for non/semi-parametric problems 
with non-Euclidean parameter space. Recently, Chen and Bickel (2006) establish semiparametric 
efficiency of estimators in ICA (Independent Component Analysis) problems using a sieve maximum 
likelihood approach. 

The rest of the paper is organized as follows. In Section [21 we present the data model for 
the functional principal components, and state the consistency results of the REML estimators. 
In Section [3l we describe the model for high-dimensional Gaussian vectors and derive asymptotic 
consistency and an efficient score representation of the corresponding REML estimators. Section 
H] is devoted to giving an overview of the proof of the consistency result for the functional data 
case (Theorems 1 and 2). Section [5] gives an outline of the proof of consistency in the matrix case 
(Theorem 3), in particular emphasizing the major differences with the proof of Theorem 1. Section 
[6] is concerned with the proof of the score representation in the matrix case (Theorem 4). Section 
[7] has a summary of the results and a discussion on some future works. Technical details are given 
in the appendices. 

2 Functional data 

In this section, we start with a description of the functional principal components analysis, and then 
make a distinction between the sparse case and the dense case. We then present the asymptotic 
results and relevant conditions for consistency under these two settings. 

2.1 Model 

Suppose that we observe data Y{ = {Yijf^lii a * the design points Tj = (Tj J )J!! 1 , i = 1, . . . , n, with 

Y ij =X i (T ij ) + ae ij , (1) 

where {sij} are i.i.d. N(0, 1), Aj(-) are i.i.d. Gaussian processes on the interval [0, 1] (or, more 
generally, [a, b] for some a < b) with mean and covariance kernel T,q(u,v) = K[Xi(u)Xi(v)]. £o 
has the spectral decomposition 

oo 
k=l 

where {V'fcl^Li are orthonormal eigenfunctions and Ai > • • • > A r > A r +i > • • • > are the 
eigenvalues. The assumption that the stochastic process has mean zero is simply to focus only on 
the asymptotics of the estimates of eigenvalues and eigenfunctions of the covariance kernel (i.e., the 
functional principal components). 

Throughout this paper we assume Gaussianity of the observations. We want to emphasize that, 
Gaussianity is more of a working assumption in deriving the REML estimators. But it plays a less 
significant role in asymptotic analysis. For the functional data case, only place where Gaussianity 
is used is in the proof of Proposition 3, and even this can be relaxed by assuming appropriate 
tail behavior of the observations. Gaussianity is more crucial in the analysis for the matrix case. 
The proofs of Proposition 6 and Theorem 4 depend on an exponential inequality on the extreme 
eigenvalues of a Wishart matrix (based on a result of Davidson and Szarek (2001)), even though 
we expect the non-asymptotic bound to hold more generally. 
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In this paper, we are primarily interested in the situation where the design points are i.i.d. 
from a distribution with density g (random design). We shall consider two scenarios, to be referred 
as the sparse case and the dense case, respectively. The sparse case refers to the situation when 
the number of measurements, nii, are comparatively small (see Bl). The dense case refers to the 
situation where the mi's are large so that the design matrix (i.e., the matrix of basis functions 
evaluated at the time points) has a concentration property (see Bl' and D). In the latter case, we 
also allow for the possibility that the design is non-random. 

Next, we describe the model space, to be denoted by M.m,t '■= -M-M,r{4>)i (for 1 < r < M) for 
the REML estimation procedure. The model space Mm,t consists of the class of covariance kernels 
C(-, •), which have rank r, and whose eigenfunctions are represented in a known orthonormal basis 
{4>k}^f=i °f smooth functions. Furthermore, the nonzero eigenvalues are all distinct. For example, 
in Peng and Paul (2007), {<f>k)k=i ls taken to be an orthonormalized cubic £?-spline basis with 
equally spaced knots. Thus, the model space consists of the elements C(-, •) = Y^k=x ^fcV'fcOV'fcO); 
where Ai > • • • > A r > 0, and (if)i(-), . . . , ip r (')) = {4 > {')) T where B is an M x r matrix satisfying 
B T B = I r , and </>(•) = (^i(-), . . . ,4>m{')) T ■ Note that we do not assume that So belongs to the 
model space. For the asymptotic analysis, we only assume that it can be well-approximated by a 
member of the model space (see condition C and Lemma 1). We define the best approximation 
error of the model as infp: . . , II So — C lip, where || • \\p denotes the Hilbert-Schmidt norm. 

C&M M ,r\<P) 11 " 11 " 

A rank r approximation to So in M.M,r{<t>) can be defined as 

r 

k=l 

with A*! > • • • > A* r > 0, and 

(*),••• ,V*r(*)) = ((f>(t)) T B*, 

where B* is an M x r matrix satisfying B^B* = I r . We refer to {(ip*k, A*fc)}£ =1 , or equivalently, 
the pair (£?*, A*), as an optimal parameter, if the corresponding S^o is a close approximation to So 
in the sense that, the approximation error || So — S*o \\f has the same rate (as a function of M) as 
the best approximation error. Henceforth, (B*, A*) is used to denote an optimal parameter. 

Observe that, under model (fT|), Yi are independent, and conditionally on they are distributed 
as iV mi (0,Si). Here, the rrii x m, matrix Sj is of the form Sj = ((S (Ty, Tjj')))™j, =1 + a 2 I mr 
Then the matrix S*j = <frf B^A^B^^i + o- 2 I mi is an approximation to Sj, where $i := [4>{Tn) : 
• • • : 4>(Ti mi )] is an M x matrix. We shall use A to denote interchangeably the r x r diagonal 
matrix diag(Ai, . . . , A r ) and the r x 1 vector (Ai, . . . , A r ) T . Note that, the parameter (B, A) belongs 
to the parameter space ft ■= S M ,r ® M+, where S M ,r = {A G R Mxr : A T A = I r } is the Stiefel 
manifold of M x r matrices with orthonormal columns. For fixed r and M, the REML estimator 
of {(V'fc) Afc)}£ =1 is defined as a minimizer over $7 of the negative log-likelihood (up to an additive 
constant and the scale factor n): 

1 n 1 n 

L n {B, A) = — J> (S-^.yf ) + — ^ log |S,|, (2) 

i=l i=l 

where Si = $jBAB T <S> t + a 2 I mi . 
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2.2 Consistency 



We shall present results on consistency of the REML estimators of functional principal components 
in the two different regimes considered above, namely, the sparse case (i.e., when the number of 
measurements per curve is "small") and the dense case (i.e., when the number of measurements per 
curve is "large"). Throughout this paper, we assume that a 2 is known, even though Peng and Paul 
(2007) provide estimate of a 2 as well. This assumption is primarily to simplify the exposition. It 
can be verified that all the consistency results derived in this paper hold even when a 2 is estimated. 
We make the following assumptions about the covariance kernel So- 

Al The r largest eigenvalues of So satisfy, (i) c\ > Ai > • • • > A r > A r+ i for some c\ < oo; (ii) 
maxi<j< r (Aj — Aj + i) _1 < C2 < oo. 

A2 The eigenfunctions {^&}fc =1 are four times continuously differentiable and satisfy 

— (4) 

max ibl oo< Cn for some < Cn < oo. 

KfcO 



SPARSE case. In this case, we only consider the situation when a 2 is fixed (i.e., it does not vary 
with n). We shall first deal with the case when m^s are bounded. Then we extend our results to 
the situation when m^'s increase slowly with sample size, and are of the same order of magnitude 
for all i (condition Bl). We also assume a boundedness condition for the random design (condition 
B2). 

Bl The number of measurements rrii satisfy m < mi <rn with 4 < m and m/m is bounded by 
some constant d,2 > 0. Also, m = 0(n K ) for some k > 0. 

B2 For each i, {Tij : j = 1, . . . , mi} are i.i.d. from a distribution with density g, where g satisfies 
Cg,o < 9( x ) < Cg.i for all x £ [0, 1], where < c 5i o < c 9t \ < oo. (3) 

Finally, we have a condition on the I 2 error for approximating the covariance kernel in the model 
space Mm,t- Define the maximal approximation error for an optimal parameter A*) as: 

I3 n := max — || - S ri \\ F . (4) 

l<i<n n%i 

C W0 n = O(^M^). 

If we use orthonormalized cubic i?-spline basis for representing the eigenfunctions, then C follows 
from A1-A2 and B1-B2, if the covariance kernel is indeed of rank r: 

Lemma 1 : If A1-A2 and B1-B2 hold, So is of rank r, and we use the orthonormalized 
cubic B- spline basis with equally spaced knots to represent the eigenfunctions, then C holds, if 
M- X (nm 2 l log n)V9 = q(1). 

Proof of Lemma 1 follows from the fact that for a cubic B-spline basis, for sufficiently large M, we 
can choose (£>*, A*) such that (i) max!<fc< r || ip k — ijj^ ||oo= 0(M~ 4 ) (by Al and A2), and (ii) 
f3 n = 0{M ) (see Appendix A). This implies that || So — S^o ||f= 0(M -4 ). The assumption that 
the covariance kernel is of finite rank can be relaxed somewhat by considering the true parameter 
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as a sequence of covariance kernels S 0>n such that the (r + l)-th largest eigenvalue A r+ i jn decays to 
zero sufficiently fast. Note that in Lemma 1, the use of B-spline basis is not essential. The result 
holds under the choice of any stable basis (i.e., the Gram matrix has a bounded condition number) 
with sufficient smoothness. 

We now state the main result in the following theorem. 

Theorem 1 (sparse case): Suppose that A1-A2, B1-B2 and C hold, andrri is bounded. Suppose 
further that M satisfies 

M — ► oo, such that M~ l {n/ log n) 1 / 9 = 0(1) and M = o(\/n/ log n) , as n — > oo. (5) 

Then, given r] > 0, there exists cq^ > such that for a n = co^ay ^ 1 M ^ osn , with probability at 
least 1 — 0(n~ v ), there is a minimizer (B,A) of |2|) satisfying 

|| B — B* \\p < a n , 
II A — A* || jr < a n . 

Moreover, the corresponding estimate of the covariance kernel, viz., T>q(u, v) = X)fc=i ^ki^k(u)^k(v) , 
satisfies, with probability at least 1 — 0(n~ ri ), 

II S - So \\f= 0{a n ). 

Corollary 1: Suppose that the conditions of Theorem 1 hold. Then the best rate of convergence 
holds if M x (n/logn) 1 / 9 , and the corresponding rate is given by a n x (log n/n) 4 / 9 . For estimating 
the eigenf unctions, this is within a factor of log n of the optimal rate. The optimal rate over a 
class C of covariance kernels of rank r satisfying conditions A1-A2, and the random design points 
satisfying conditions B1-B2 (within bounded), is n -4 / 9 . 

Notice that, the rate obtained here for the estimated eigenvalues is not optimal. We expect a 
parametric rate of convergence for the latter, which can be achieved by establishing an efficient 
score representation of the estimators along the line of Theorem 4. The following result generalizes 
Theorem 1 by allowing for mi's to slowly increase with n, and its proof is encapsulated in the proof 
of Theorem 1. 

Corollary 2: Suppose that, A1-A2, B1-B2 and C hold. Suppose further that, m and M satisfy 

(i) m 4 Mlogn = o(n), (ii) max{m 3 M 5/2 (log?i) 2 , m 7/2 M 2 (logn) 3/2 } = o(n), 
(in) M~ l (rim 2 / log n) 1/9 = 0(1), (iv) m 2 M 2 \ogn = o(n). (6) 

Then the conclusion of Theorem 1 holds. Also, the best rate is obtained when M x (rim 2 / log re) 1 / 9 , 
and the corresponding a n x m 10 / 9 (log n/n) 4//9 . 

Condition (i) is required to ensure that m 2 a 2 = o(l); condition (ii) is needed to ensure that the 
upper bound in (|4U|) in Lemma 4 is o(l); condition (iii) ensures that C holds; and finally, condition 
(iv) is used in proving Lemmas 4, 5 and 6 in Appendix B. A sufficient condition for ([6]) to hold is 
that m = 0(n 1 /^) and M x (rem 2 / log n) 1 / 9 . Notice that the best rate obtained in Corollary 2 is 
not optimal in general. It is near-optimal (up to a factor of log n of the optimal rate) only when m 
is bounded above (Theorem 1). 
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DENSE case. This case refers to the scenario where the number of time points per curve is large, 
such that mini<j< n , rrii — > oo sufficiently fast (see condition D and the corresponding discussion). 
For simplicity, we assume further that the number of design points is the same for all the sample 
curves, which is not essential for the validity of the results. Denote this common value by m. 
In terms of the asymptotic analysis, there is an important distinction between the sparse case 
and dense case. For the purpose of further exposition and the proof of the result on consistency 
of REML estimator in the dense case, it is more convenient to work with the transformed data 
% = $iYi. Let Ti = i^S^f and R { = ^<&i<$>J ■ Then = mRiBAB T Ri + a 2 R { . Then, a way 
of estimating {(A^, ^ fc )}]* =1 is by minimizing the negative log-likelihood of the transformed data: 

1 - 1 ~ ~ 1 n 

i=l i=l 

Notice that, if R^s are non-singular for all i, then by direct computation, we have that the negative 
log-likelihoods for the raw data: ([2]) and that of the transformed data: ([7]) differ only by a constant 
independent of the parameters B and A. Hence, on the set {Ri are non-singular for all i}, the 
estimators obtained by minimizing ([2]) and ([7]) are the same. Assumptions Bl and B2 are now 
replaced by: 

Bl' m = 0{n K ) for some k > 0. 
D Given rj > 0, there exist constants ci tV , c%^ > such that 



max II Ri — Im ||< c\ „W — r , max || B^RiB^ — I r ||< C2 V — r I - 1 — Ofn v 

i<i<n y mlogn i<i<n m log n J 



(8) 

Denote the event described in ([8]) by Ai iV . Note that Ai tV is defined in terms of T := {Tij : 
j = 1, . . . , m; i = 1, . . . , n} alone. We assume throughout that a 2 < m (note that a 2 /m can be 
viewed as the signal-to-noise ratio. Therefore, for n large enough, on Ai }T) , Ri is invertible for 
all i. The condition D gives concentration of individual Ri's around the identity matrix and is 
discussed in more detail at the end of this section. Finally, we make an assumption about the 
maximal approximation error (3 n defined through @ which differs slightly from the condition C in 
the sparse case. 



C Given r] > 0, there is a constant c v > such that j3 n < c^^y — *° g - with probability at least 
1 - 0(n-"). 

A result similar to Lemma 1 can be proved to ensure condition C when a stable basis is used. 
Theorem 2 (dense case): Suppose that A1-A2, Bl', C and D hold, and m > a 2 > 0. Then, 
given rj > 0, there exists Co, v > such that for a n = co^y ^^m w ^ n probability at least 
1 — 0(n _?? ), there is a minimizer (B,A) of ^ satisfying 

\\ (I M - B,Bj)(B - B,) \\ F < a n , 
|| Bj(B - B*) \\ F < 



Tfl 

A - A* \\ F < \l—zOL r 
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Further, the corresponding estimated covariance kernel Y^q(u,v) = X^fc=i ^&V'k( w )V'fc(' y ) satisfies, 
with probability at least 1 — 0(n _?? ), 

l|£o _^o(^). 

The proof of Theorem 2 requires a slight refinement of the techniques used in proving Theorem 3 
stated in Section [3.21 making heavy use of condition D. To save space, we omit the proof. Note 
that the best rate in Theorem 2 implicitly depends on conditions C and D in a complicated way, 
which is not the optimal rate for estimating the principal components. The optimal rate for l<i risk 
of the eigenfunctions in this context is conjectured to be of the order max{(cr 2 /nm) 8 / 9 , (1/n)} with 
the second term within brackets appearing only when r > 1. This can be verified for the case r = 1 
with a refinement of the proof of Corollary 1. 

Discussion on condition D: We shall only consider the setting of an uniform design - either fixed, 
or random. The condition ([8]) clearly requires m to be sufficiently large, since it gives concentration 
of individual R^s around the identity matrix. To fulfil D, we also need some conditions on the 
basis functions used. Specifically, we concentrate on the following classes of basis functions. We 
assume that the basis functions are at least 3 times continuously differentiable. 

El (Sinusoidal basis) m&XKk<M II 0fe ||oo = 0(1). 

E2 (Spline-type basis) (i) For any k £ {1, . . . , M}, at most for a bounded number of basis func- 
tions <j)[, supp(0 fc )n supp(^) is nonempty; (ii) maxi< fc < M || 4> k \\ 00 = 0(VM). 

One of the key observations in the case of functional data is that, the eigenfunctions {^*fc}£ =1 of the 
kernel So* (belonging to the model space) have the same degree of smoothness as the basis {4>k\k=\^ 
and the functions {ip^k}k=i an( ^ their derivatives are bounded. Also, notice that, B^RiB^ = 
((m Sj=i i ; *k(Tij)4>*i(Tij))) r k i =1 . Based on these observations, we present some sufficient conditions 
for ([8j) to hold under the uniform design and bases of type El or E2. We omit the proof, which 
uses Bernstein's inequality (in the random design case) and the Trapezoidal rule (in the fixed design 
case) . 

Proposition 1: Suppose that the basis is of type El or E2. In the case of random, uniform design, 
(0) is satisfied if (Mlogn) 2 jo 2 = 0(1), and \fra\ogn j o 2 = O(l). In the case of fixed, uniform 
design, holds (with probability 1) if M ^ gn (l + ^f/o- 2 = 0(1), and \ogn/a 2 = O(l). 
Moreover, in this setting, if the eigenfunctions {ipk}k=i vanish o,t the boundaries, and if the basis 
functions are chosen so that they also vanish at the boundaries, it is sufficient that M ?°f n /o~ 2 = 
O(l) and ^/a 2 = 0(l). 

Note that, two obvious implications of Proposition 1 are that (i) m needs to be rather large; and 
(ii) a 2 may need to grow with n, in order that D holds. 

Remark 1: It is to be noted that even though the consistency results for the functional data 
problem are proved under a specific choice of the basis for representing the eigenfunctions, viz., the 
(orthonormalized) cubic B-spline basis with equally spaced knots, this is by no means essential. 
The main features of this basis are given in terms of the various properties described in Appendix 
A. The crucial aspects are: (a) the basis is stable; (b) the basis functions have a certain order of 
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smoothness; and (c) the basis functions have fast decay away from an interval of length 0(M _1 ) 
where M is the number of basis functions used. Same consistency results can be proved as long as 
those properties are satisfied. 

Remark 2: When to, the number of measurements is bounded, we can relax condition A2 to that 
the eigenfunctions are twice continuously differentiable and with bounded second derivative, and 
under this assumption we can prove a result analogous to Theorem 1 and Corollary 1, with the 
corresponding optimal rate of convergence being n -2 / 5 instead of n _4//9 . 

3 High-dimensional vector 

In this section, we describe a scenario where the observations are i.i.d. Gaussian vectors, which 
can be approximately represented in a known lower dimensional space (see C"), where the effective 
dimensionality of the observations grows at a rate slower than the sample size. For convenience, we 
refer this setting as the matrix case. It can be seen that besides the proofs of the results derived 
in this section sharing a lot of common features with those in Section these results also suggest 
an asymptotic equivalence between the dense case for functional data, and the matrix case. This 
means that understanding one problem helps in understanding the other problem. In particular, we 
conjecture that the results derived for the Gaussian vectors, such as the efficient score representation 
(Theorem 4), can be carried over to the functional data case with dense measurements. 

3.1 Model 

Suppose that we have i.i.d. observations Y±, ■ ■ ■ ,Y n from N m (0, £). Assume the covariance matrix 
£ has the following structure 



This may be regarded gnal-plus-noise" model, with a 2 representing the variance of the 

isotropic noise component. We further assume that So has at least r positive eigenvalues, for some 
r > 1. The eigenvalues of T,q are given by sAi > • • • > s\ r > s\ r+ i >•••>(), where s > is 
a parameter representing the "signal strength" (so that s/a 2 represents the signal-to- noise ratio). 
We assume that the observations can be well represented in a known M dimensional basis $ with 
M < to (condition C"). Then the model space Mm,v(^) (with r < M < to) is defined as the set of 
all to x to matrices £ of the form £ = s& T BAB T <& + a 2 I m , where $ is an M x m matrix satisfying 
= Im, B G Sm,t and A is r x r, diagonal with positive diagonal elements. Note that, in order 
to prove consistency of the REML estimator, we require that the intrinsic dimension M grows with 
n sufficiently slowly. In fact, it has been shown (e.g. in Paul (2005)) that, when s/a 2 = 0(1), M 
must be o(n) to achieve consistency. 

Throughout we assume that a 2 and s are known. Of course, we can estimate the eigenvalues of 
£ without any knowledge of s. The unknown parameters of the model are B and A. The parameter 
space is therefore f2 = Sm,t ® W±- The estimate (B, A) of (B, A) is obtained by minimizing over Q 
the negative log-likelihood (up to an additive constant and the multiplicative factor n), 



S = S + a I m . 




(9) 



i=l 



We then set the estimator of the first r eigenvectors of S as ^ = $ B. 
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Similar to the dense case, for asymptotic analysis, it is more convenient to work with the 
transformed data = Let Y = $S$ T = sBKB T + a 2 I M . Then one can obtain estimates of 
{B, A) by minimizing over f2 the negative log-likelihood of the transformed data: 

i n „ ~ 1 

L n (J3, A) = — tr (r- 1 £ ^if ) + - log |r|, (10) 

i=l 

which results in the same estimate obtained by minimizing ©. 

Remark 3: It is known that (Muirhead, 1982), in the setting described above, the REML esti- 
mators of (B, A) coincide with the first r principal components of the sample covariance matrix 
of Yi = &Yi, i = l,...,n. On the other hand, based on the calculations carried out in Ap- 
pendix D, it is easy to see that the PCA estimators (B PC , A PC ) satisfy the likelihood equations 
V BL n (B PC , A PC ) = and V^L n (B PC , A PC ) = 0. Thus, our approach provides an independent 
verification of the known result that the PCA estimates are REML estimators under the rank- 
restricted covariance model studied here. 

3.2 Consistency 

We make the following assumptions about the covariance matrix. 

Al' The eigenvalues of So are given by sX\ > ■ ■ ■ > s\ m > and satisfy, for some r > 1 (fixed), 
(i) ci > Ai > • • • > A r > A r+ i for some c\ < oo; (ii) maxi<j< r (Aj — A J+ i) _1 < C2 < oo. 

C" Assume that there exists (B*, A*) G (referred as "optimal parameter") such that, the matrix 
S^o = s$> T B*A*Bj'$> is a close approximation to So in the sense that f3 n :=|| So — S^o ||f= 

Note that C" implies that the observation vectors can be closely approximated in the basis <3?. 

Theorem 3 (matrix case): Suppose that Al' and C" hold, and s > a 1 > 0. Then given n > 0, 
there exists cq^ > such that for a n = c 0,riO~\J~- n °f n , with probability at least 1 — 0(n~ ri ), there is 
a minimizer (B, A) of il(J\) satisfying 

|| (I M -B*B?)(B- 

II Bj(B - 

II A 

Observe that the rates obtained in Theorem 2 and Theorem 3 are identical once we replace m 
in Theorem 2 by s. Thus the number of measurements m in the dense case is an analog of the 
signal strength s in the matrix case. This important observation suggests an asymptotic equivalence 
between these two problems. This is a result of the concentration of the matrices {Ri}f =1 around 
Im for the dense case (condition D). Under the matrix case, the analogs of Ri exactly equal the 
identity matrix. Moreover, Theorem 3 establishes the closeness of the REML estimator to the 
optimal parameter, which serves as an important step towards proving Theorem 4. 



-B*) \\f < ct n , 



A* 



< 
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3.3 Efficient score representation 

When the observations are i.i.d. Gaussian vectors, we can get a more refined result than the one 
stated in Theorem 3. In this section, we show that by using the intrinsic geometry, we can get 
an efficient score representation of the REML estimator (and hence PC A estimator). In Paul and 
Johnstone (2007), a first order approximation to the sample eigenvectors (i.e. PCA estimates) is 
obtained using matrix perturbation theory (Kato, 1980). Subsequently, it has also been shown there 
that the rate of convergence of l 2 -risk of PCA estimators is optimal. Here, we show that the efficient 
score representation of the REML estimator coincides with this first order approximation when the 
signal-to-noise ratio s/a 2 is bounded (Corollary 3). Our approach is different from the perturbation 
analysis. It also quantifies the role of intrinsic geometry of the parameter space explicitly. Our result 
gives an alternative interpretation of this approximation, and consequently, the score representation 
points to an asymptotic optimality of the REML (and hence PCA) estimator. 

We first introduce some notations. More details can be found in Appendix D. Let £ = log A 
(treated interchangeably as an r x 1 vector and anrxr diagonal matrix) . The the parameter space 
for (B,Q is fl := S M ,r © K r . Let T B := {U G R Mxr : B T U = -U T B} denote the tangent space 
of the Stiefel manifold Sm,t at B. Then the tangent space for the product manifold £1 at (B, £) is 
Tb © W (see Appendix E for the definition of the product manifold and its tangent space). 

For notational simplicity, we use 9* to denote (!?*,£*) and 9q to denote (So, Co)- Define 
L(6o;0*) = Kg t L n (9o). Let VL n (-) and VL(-;0*) denote the intrinsic gradient of the functions 
L n (-) and £(•;#*) with respect to (-£>,£), respectively. Also, let H n {-) and H(-;9*) denote the 
intrinsic Hessian operator of the functions L n {-) and !/(•;#*) with respect to (-£>,£), respectively. 
Let if _1 (-; #*) denote the inverse Hessian operator of L(-;0*). Also we use i?s(-;^*) to denote the 
Hessian of L(-; 9*) w.r.t. B. Notations for Hessian w.r.t. £ and gradients w.r.t B and £ are defined 
similarly. 

The following result gives the efficient score representation of the REML estimator in the situ- 
ation when a 2 = 1 and s = 1. The result can be extended via rescaling to the case for arbitrary a 2 
and s with s > a 2 > 0, and s/a 2 being bounded. 

Theorem 4 (score representation): Suppose that Al' and C" hold with a 2 = 1, s = 1, and 

M = o(n a ) for some a G (0, 1). Let 7 n = max{ \J~ ^ IV ^ S " , n } ■ Then there is a minimizer (B,A) of 
the negative log-likelihood U0\) such that, with probability tending towards 1, 

B-B* = -H B 1 (9,;9,)(V B L n (9,))+0(^) (11) 
A -A* = -A^ c - 1 (^;^)(V ? L n (^)) + 0( 7 2). (12) 

In particular, from this representation, we have, with probability tending towards 1, 

\\B-B* \\ F = 0( 7n ); (13) 

|| A — A* \\ F = 0{^ + 1 2 n ). (14) 

Note that Theorem 4 gives the optimal rate of convergence for / 2 -risk of the estimated eigenvectors 
when j3 n = (i.e., no model bias). This result follows from the minimax lower bound on the risk 
obtained by Paul and Johnstone (2007). Note that, this lower bound under the current setting 
follows essentially from the proof of Corollary 1. Also, when a < 1/2, this result shows that A 
converges at a parametric rate. Indeed, the representation (|12p implies asymptotic normality of A 
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when a < 1/2. In the derivation of Theorem 4 we need to compute the Hessian and its inverse, 
which leads to the following representation. 

Corollary 3: Under the assumptions of Theorem 4, we have the following representation: 

Hg 1 (6*;8*)(V BL n (9*)) = [Ri5B*i : ••• : H r SB^ r ], 
where B^j is the j-th column of B*, and 

R j = ^2 7\ — _ \ \ B * iB *i ~ ~\ i^M ~ B^B^), 

is the resolvent operator corresponding to "evaluated at" (1 + A*j). 

Combining Corollary 3 with (jlip . we get a first order approximation to B which coincides with the 
approximation for sample eigenvectors obtained in Paul and Johnstone (2007). However, Theorem 
4 has deeper implications. Since it gives an efficient score representation, it suggests an asymptotic 
optimality of the REML estimators in the minimax sense. 



4 Proof of Theorem 1 

Since a 2 is fixed and assumed known, without loss of generality, we take a 2 = 1. In this section, 
we give an outline of the main ideas/steps. The details of the proofs are given in Appendix B. The 
strategy of the proof is as follows. We restrict our attention to a subset @(a n ) of the parameter 
space (referred as the restricted parameter space), which is the image under exponential map of 
the boundary of an ellipsoid centered at 0, in the tangent space of an "optimal parameter". We 
then show that with probability tending towards 1, for every parameter value in this restricted 
parameter space, the value of the negative log-likelihood is greater than the value of the negative 
log-likelihood at the optimal parameter. Due to the Euclidean geometry of the tangent space, this 
implies that with probability tending towards 1, there is a local maximum of the log- likelihood 
within the image (under exponential map) of the closed ellipsoid. The key steps of the proof are: 

(i) Decompose the difference between the negative log-likelihood at the optimal parameter and an 
arbitrary parameter in the restricted space as a sum of three terms - a term representing the 
average Kullback-Leibler divergence between the distributions, a term representing random 
fluctuation in the log-likelihood, and a term representing the model bias (equation (|19H ) . 

(ii) For every fixed parameter in the restricted parameter space: (a) provide upper and lower 
bounds (dependent on a n ) for the average Kullback-Leibler divergence; (b) provide upper 
bounds for the random term and the model bias term. In both cases, the bounds are proba- 
bilistic with exponentially small tails. 

(iii) Use a covering argument combined with a union bound to extend the above probabilistic 
bounds on difference between log-likelihoods corresponding to a single parameter in 0(a n ) to 
the infimum of the difference over the entire 0(a n ). 

The strategy of this proof is standard. However, in order to carry it out we need to perform detailed 
computations involving the geometry of the parameter space such as the structure of the tangent 
space and the exponential map. Note that, in the current case the geometry of the parameter 



13 



space is well- understood, so that there exist explicit form of the exponential map and a precise 
description of the tangent space. This helps in obtaining the precise form of the local Euclidean 
approximations around an optimal parameter in the derivations. 



4.1 Parameter space and exponential map 

We use the following characterization of the tangent space Tb of the Stiefel manifold Sm r at a point 
B. Any element U G Tb can be expressed as U = BAjj + Cjj, where Ajj = —Ajj and B t Cjj = O. 
We then define the restricted parameter space centered at an optimal parameter (U*, A*) by 

8(on) := {(exp(l, B^Ajj + Cy), A* exp(D)) : Ay = —Aj}, BjCu = 0,D £ R r , 

such that || Ay ||| + || C v \\ 2 F + || D |||= a£}. (15) 

In the definition of @(a n ) an d henceforth, we shall treat A and D interchangeably as an r x 1 
vector, and an r x r diagonal matrix. The function exp(i, U) is the exponential map on Smt & t B*i 
mapping a tangent vector in Tb, to a point on the manifold. For U G Tb, and t > 0, it is defined 
as 



exp(t, U) = B*M(t, U) + QN(t, £7), where 



M(t, £/)' 
N(t,U) 



exp i 



R O 



where exp(-) is the usual matrix exponential, and QR = (iju — B^B^U is the QR-decomposition. 
The properties of the map exp(l, •) that we shall heavily use in the subsequent analysis (see 
Appendix B) are : for U G Tb, , 

^ T (exp(l,[/)-^) = BjU + 0((\\BlU\\ F + \\(I M -B*Bl)U\\ F )\\U\\ F ), (16) 

(I M - B,Bj)exp(h U) = (hi - B*Bj)U + O (|| (I M - B*Bj)U \\ f \\ U \\ f ) , (17) 

as || U \\f^ 0. These properties are easily verified by using the definition of the matrix exponential 
exp(-), and the Taylor series expansion. 

4.2 Loss decomposition 

We shall show that, given n > 0, for an appropriate choice of the constant co„ in the definition of 
a n (in Theorem 1), for large enough n, we have 



inf L n (B, A) > L n (B*,A+) > 1 - 0(n _?? ). (18) 

n ) J 

From this, it follows immediately that with probability tending towards 1, there is a local minimum 
(B,A) of L n (B,A) in the set @(a n ) defined as 

Q(a n ) = {(exp(l,B*A u + C u ),A*exp(D))-A u = -A u ,BTc u = 0,D£R r 
such that || A v ||| + || C v ||| + || D |||< o£}, 

which concludes the proof of Theorem 1. 
We start with the basic decomposition: 

L n (B, A) — £„(£?*, A*) 
= [EL n (B,A) -EL n (B„A.)] + [(L„(B, A) - EL n (B, A)) - (£„(£?*, A*) - EL„(S», A*))] 

= - E £*«) + -E tr ((^r 1 - - SO) + - E tr ((^ - - . (19) 

2 — 1 z— 1 i=l 
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where Si = YiY^ and 



Si s; 



-1/2 



ilog|I mi +S, ^(E^-S^S, 1/2 | 



is the Kullback-Leibler divergence corresponding to observation i. Note that the proofs of Theorems 
2 and 3 share a lot of commonality with the sparse case discussed here, in that these proofs depend 
on the same basic decomposition of the loss function. 

4.3 Probabilistic bounds for a fixed parameter in 0(a n ) 

In order to derive the results in the following three propositions, we need to restrict our attention to 
an appropriate subset of the space of the design points T which has high probability. Accordingly, 
given rj > 0, we define such a set through (|34p in Proposition 8 (in Appendix A). The following 
proposition gives probabilistic bounds for the average Kullback-Leibler divergence in terms of a n . 

Proposition 2: Given r\ > 0, for every (B, A) G Q(a n ), there is a set A^'J^ (depending on (B, A) ), 
defined as 



A 



B,A 
1,V 



d' v al < - ^ K(Y,i, £*j) < d'^m 2 a 2 n > , 
n i=l J 



(20) 



for appropriate positive constants d' v and d'^ (depending on Ai and r), such that for n large enough, 
F{A V n {A^Y) = 0(n^ 2+2 ^ Mr ^). 

Note that the bound in (|20p is not sharp when m — > oo, which leads the suboptimal rates in Corol- 
lary 2. The following propositions bound the random term and the bias term in (|19j) . respectively. 

B A 

Proposition 3: Given r\ > 0, for each (B,A) G @(a n ), there is a set A 2 ' , defined as, 

(- n 

i i=i 



A 



B,A 
2,r) 



< d^man 



M log n 



for some d v > 0, such that, W(A 1>V n (Af ' r , ) c ) = 0{n~^ 2+2 ^ Mr - r '). 



Proposition 4: Given rj > 0, for each (B,A) G Q(a n ), there is a set A^'J^, defined as, 



i=l 



/Mlogn 

< a v ma n 



}■ 



for some constant d v > 0, such that for large enough n, V(A V n (A^^) c ) = 0(n ( 2 + 2K ) Mr r ?). 

Combining Propositions 2-4, we obtain that, given r\ > 0, there is a constant co j?? , such that, for 
every (B, A) G 6(a n ), 



P({L„(B,A) - L n (B*,A*) < = 0(n-( 2+2K ) A/r ^). 



(21) 
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4.4 Covering of the space Q(a n ) 

To complete the proof of Theorem 1, we construct a 5 n -net in the set 0(a n ), for some S n > 
sufficiently small. This means that, for any (B\,Ai) G @(a n ) there exists an element (I?2,A2) 
of the net (with Bj, = exp(l, B*Arj k + Cjj k ) and A*. = A*exp(.Dfc), k = 1,2), such that we have 
II B x -B 2 HI + || Ai-A 2 |||< $1- The spaces {A G P/ Xr : ,4 = -A T } and {C G R Mxr : flJC = O} 
are Euclidean subspaces of dimension r(r — l)/2 and Mr — r 2 , respectively. Therefore, 0(a n ) is 
the image under (exp(l, •), exp(-)) of a hyper-ellipse of dimension p = Mr — r(r + l)/2. Thus, 
using standard construction of nets on spheres in M p , we can find such a 5 n -net C[5 n ], with at most 
dimaxjl, (a n 5~ 1 ) p } elements, for some d\ < oo. 

If we take 5 n = (m 2 n) _1 , then from (|2ip using union bound it follows that, for n large enough, 

P(( inf L n (B,A)-L n (B*,A*)>la 2 n ) n A v ) >l-0(n^). 
\\(B,A)eC[S n ] 2 "J 7 

This result, together with the following lemma and the fact that P(^4r?) > 1 — 0(n~' 7 ) (Proposition 
8), as well as the definition of C[5 re ], proves (|18p . The proof of Lemma 2 is given in Appendix B. 

Lemma 2: Let (Bk,Ak), k = 1,2, 6e any £u>o elements of ®{a n ) satisfying \\ B\ — B2 ||| + || Ai — 
A2 Wp< <5 2 , with 5 n = (ra 2 n) _1 . Then, given rj > 0, there are constants d^^, d^ ri > 0, such that, the 

maxi<j< n || Sj SiT, i — I m . ||^< d3 5 ^mlogn> satisfies P^^jT) > 1 — C^n"^" 1 ), 

/or T G A n ; and on A^ n , we have \L n {B\, Ai) — L n (I?2, A2) | = o(q 2 ). 



5 Proof of Theorem 3 

There is essentially only one step where the proof of Theorem 3 differs from that of Theorem 
1. It involves providing sharper bounds for the Kullback-Leibler divergence between an "optimal 
parameter", and an arbitrary parameter in the restricted parameter space @(a n ), an ellipsoid in 
the tangent space at the "optimal parameter" : 

8K) = {(exp(l, B*Au + Q/), A* exp(D)) : Ay = —Ajj, B^Cy = 0,D £R r 

O" 2 O" 2 

such that — || Ay ||p + II Cjj \\p H II D \\%>= a 2 ,,}. 

s s 

Note that now the restricted parameter space is the image (under exponential maps) of an ellipse, 
whose principal axes can differ substantially depending on the signal-to-noise ratio s/a 2 . This is 
crucial for obtaining the sharper bounds for the Kullback-Leibler divergence (see equation (|58H ). 
As in Section^ our strategy is to show that, given rj > 0, for an appropriate choice of co ;?? , for 
large enough n, we have 

P( inf L n {B,A) >Z n (£*,A*) ) > l-0(n-"). 

\(S,A)eG(a„) J 

From this, we conclude the proof of Theorem 3 using similar arguments as in the proof of Theorem 
1. 
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Define S = ± Ya=i Y i Y i \ where Y i = ® Y i- Tnen > for an arbitrary (B, A) G 6(a n ), we have the 
following decomposition: 



L n (B,A)-L n (B*,A*) = [(Z re (£,A)-EL n (B,A)J - (L n (B„ A,) - EL n (B„ A,) 

jf(r, r*) + itr (V 1 - r^)(5 - r)) + itr ((r- 1 - rr^r - r*)) 



(22) 
with 

#(r,r*) = ^(r-^r, -r))-iio g |i M + r- 1 (r,-r)| 

= itr (r-i/a(r, - r)r-V2) _ i log | Im + r -V2 (r# _ r)r -i/ 2 | 5 

being the Kullback-Leibler divergence between the probability distributions Nm(0, T) and Nm(0, T*), 
where r -1 / 2 = (r 1 / 2 ) -1 , and T 1 / 2 is a symmetric, positive definite, square root of T. The following 
is an analogue of Proposition 2. 

Proposition 5: Under the assumptions of Theorem 3, there exist constants c',c" > such that, 
for sufficiently large n, 

ca 2 n (±) < K(T, r*) < c"a 2 n , (23) 

for all (B,A) G @(a n ), where V = sBAB T + a 2 I M and T* = sB^A^B^ + a 2 I M ■ 
The following are analogues of the Propositions 3 and 4, respectively. 

Proposition 6: Given rj > 0, there exists a constant > 0, such that for each (B,A) G Q(a n ), 



tr((T-'-T^)(S-T)) <cj^^ x f^a n ) > 1 - (ri -(W^ 



n v a 



This proposition can be easily proved using an exponential inequality by Davidson and Szarek 
(2001) on the fluctuations of the extreme eigenvalues of a Wishart matrix. 

Proposition 7: There is a constant c > such that, uniformly over (B,A) G @(a n ), 

| «r ((r- 1 - r- 1 ) ^ - r.)) | < || r- 1 - r- 1 |M|r-r, \\ F < c^J^a n p n . 

Propositions 5-7 (together with conditions Al' and C") show that, for an appropriate choice of 
com, L n (B,A) — L n (B*,A*) > c'a^, for some d > with very high probability, for every fixed 
(B,A) G @(a n ). The proof of Theorem 3 is finished by constructing a 5 n -net similarly as in Section 
3] for the sparse case. 



6 Proof of Theorem 4 

The basic strategy of the proof is similar to that in classical inference with Euclidean parameter 
space. The main difficulty in present context lies in dealing with the Hessian operator of the log- 
likelihood (intrinsic Fisher information operator) and its inverse. Details of these calculations are 
given in Appendix D. 
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Rewrite the negative log-likelihood (fTUj) (up to a multiplicative constant) as 

1 n _ 

L n (S,A) = tr(r _1 5)+log|r|, where S = -^YiY?. (24) 



n 

8=1 



By Theorem 3, given 77 > 0, there is a constant C3 jr? > such that the set 

:={\\U \\ 2 F + || D |||< cs^a 2 } 



has probability at least 1 — 0(n v ), where a n = cq,t)\I ~ „ g ~ , and {U,D) G Tb« ®W is such that 

(B,A) := (exp(l, J7), A* exp(D)) is a minimizer of ([2"3]) . 

First, by the same concentration bound for singular values of random matrices with i.i.d. Gaus- 
sian entries (Davidson and Szarek, 2001) used in the proof of Proposition 6, there exists 

> 0, 

such that the set 



A 4, v ■= { || S - T || < c 4:V 



M V log n 



n 

has probability at least 1 — 0(n _,? ). It then follows that, we can choose an appropriate constant 

c 5ir) > such that, on A 3tV n A^, \\ VL n (0*) ||< c 5tV j n , where 7 n = max{ y / ^% £Ii i Pn } and 
0* = (B»,A*). Next, for any A = (A B , A c ) G T]j,©I r , define 

1/2 



A I := 



As ||f + || A^ ||p 



Also, let {-,-) g denote the canonical metric on Tb„ © M r (see Appendix D). Using the fact that 
VL n (9) = 0, where 9 = (B, A), and defining A := (U,D), then on A 3 ,r,r\A\ v , for any A G T B , ©R r 
with || A ||< 1, 

-(VL n (0*),X) g = (VL n (9)-VL n (9*),X) g 

= (H n (6.)(A), X) g + 0(|| A || 2 ) + 0( 7 „ || A ||) 

= (H{9*;9*){A),X) g + ([H n (9*) - H(9*;9*)](A),X) g + 0(a? + a n7 „)(25) 

where H n (-)(A) and ff(-;0*)(A) are the corresponding covariant derivatives of L n ( - ) and L(-;0*) 
in the direction of A. By simple calculations based on the expressions in Appendix D, there exists 
a constant cq :V > 0, such that on A 3iV n A^, || H n (9*){A) — H(9*;9*)(A) \\< CQ tV a n j n . It can be 
checked using assumptions Al' and C" that the linear operator H~ 1 (9^; 0*) : Tb_ ©M r — > 7s,, ©M r , 
is bounded in operator norm (Appendix D). Therefore, using the definition of covariant derivative 
and inverse of Hessian, from (|25p we have, on A 3 ^ n A^, 

A = -i?" 1 ^*; 0*)(VL n (0*)) + 0(a n7n ) + 0(a 2 ). (26) 

Hence, on A 3)T) n Ai^, the bound on || A || can be improved from 0(a n ) to 

\\A\\=0( ln a n + a 2 n ). (27) 

We can then repeat exactly the same argument, by using (|25f) to derive (I26p . but now with the 
bound on || A || given by ([27]) . Since M = 0(n a ) for some a < 1, so that a 2 = o(7„), this way we 
get the more precise expression, 

A = -H-\6*;6*)(VL n {9*)) + 0( 7 2 ). (28) 



18 



Moreover, it can be easily verified that, 



d 

7^:V ' BL n {6*) 



0. 



Hence, by (|70p in Appendix E, the Hessian operator, and its inverse, are "block diagonal", on the 
parameter space (viewed as a product manifold), with diagonal blocks corresponding to Hessians 
(inverse Hessians) w.r.t. B and £, respectively. This yields (jlip and (|12p in Theorem 4. Also, (|13p 
and (I14p follow immediately from (|28|) . 



7 Discussion 

In this paper, we have demonstrated the effectiveness of utilizing the geometry of the non-Euclidean 
parameter space in determining consistency and rates of convergence of the REML estimators of 
principal components. We first study the REML estimators of eigenvalues and eigenfunctions of 
the covariance kernel for functional data, estimated from sparse, irregular measurements. The 
convergence rate of the estimated eigenfunctions is shown to be near-optimal when the number of 
measurements per curve is bounded and when M, the number of basis functions, varies with n at an 
appropriate rate (Theorem 1 and Corollary 1). The technique used in proving Theorem 1 is most 
suitable for dealing with the very sparse case (i.e., number of measurements per curve is bounded). 
We have also used it to prove consistency for the case where the number of measurements increases 
slowly with sample size (Corollary 2). However, this does not result in the optimal convergence 
rate. The latter case is more difficult because of the complications of dealing with inverses of 
random matrices (Xj) of growing dimensions. A more delicate analysis, that can handle this issue 
more efficiently, is likely to give tighter bounds for the average Kullback-Leibler divergence than 
that obtained in Proposition 2. Then it may be possible to extend the current technique to prove 
optimality of the REML estimators in a broader regime. A variant of the technique used for proving 
Theorem 1 also gives consistency of the REML estimator for functional data in a regime of dense 
measurements, as well as for a class of high-dimensional Gaussian vectors (Theorems 2 and 3). 
In the latter case, we also derive an efficient score representation (Theorem 4), which involves 
determining the intrinsic Fisher information operator and its inverse. 

Now we present some conjectures we aim to pursue. First, as discussed earlier, based on 
the score representation, we conjecture the asymptotic optimality of the REML estimator for the 
matrix case. Secondly, we conjecture that there exists an efficient score representation of the REML 
estimator in the functional data problem as well. If so, then this estimator is likely to achieve the 
optimal nonparametric rate (for a broader regime), and may even be asymptotically optimal. This 
may explain the superior numerical performance of the REML estimator observed by Peng and 
Paul (2007). Thirdly, our results (Theorems 2 and 3) give a strong indication of an asymptotic 
equivalence between two classes of problems : statistical inference for functional data with dense 
measurements; and inference for high-dimensional i.i.d. Gaussian vectors. Finally, in this paper 
we have not addressed the issue of model selection. A procedure for selection of M and r, based 
on an approximate leave-one-curve-out cross-validation score, has been proposed and implemented 
in Peng and Paul (2007). This approximation is based on a second order Taylor expansion of the 
negative log-likelihood at the estimator and it involves the intrinsic Fisher information operator and 
its inverse. Therefore, based on the analysis presented here, it is conjectured that the approximate 
CV score thus defined is asymptotically consistent for the class of models considered in this paper. 
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Appendix A : Properties of cubic B-spline basis 

In many proofs of this paper, we need to use some properties of the cubic .B-spline basis. We 
state some of them. More details can be found in de Boor (1978) and deVore and Lorentz (1993). 
Let cj) = (</>i, . . . , <Pm) t be the (standard) cubic B-spline basis functions on [0,1] with equally 
spaced knots. Then, the orthonormalized spline functions (f>i, . . . , 4>m are defined through cf)(t) = 

G '4>(t) , where G^m '■= ((J* ( ^fc(i) < M*) c ^))fc^=i) is the Gram matrix of 0. It is known (cf. de 
Boor (1974), Burman (1985)) that G^^m is an M x M banded matrix, and satisfies, 



c <j>,o 



hi < GV,m < 



C0,1 



Im for some constants < c<^o < c^i < oo. 



(29) 



M ~ v ' ~ M 

From this, and other properties of cubic B-splines (deVore and Lorentz, 1993, Chapter 13), we also 
have the following: 

51 sup ig [ ,i] J2k=i 4>k{t) — c 4>,2^1 for some constant c<^2 > 0. 

52 For any function / G C( 4 )([0,1]), we have || / - P^uif) l|oo=|| / (4) IU 0(M" 4 ), where 
P&mU) = YlLiU, 4>k)4>k denotes the projection of / onto span{^i, . . . , <j) M } = span{(^i, . . . , ^ M }- 

Note that, property S2 and assumption A2 imply the existence of orthonormal functions {ip*k}k=i 
of the from 

(V*i(i), • • . ,<M*)) = W*(t)) T = Bjftt), B^B* = I r , 

which satisfy 

(30) 



max \\ip k - ?/>*fc 

Kk<r 



)< ca 3 M 4 max || ip k ^ ||oo 

Kk<r 



Using these properties we obtain the following approximation to the important quantity || $j ||, 
where $j = [4>{Tn) : ... : 4>{Ti rni )] and || • || denotes the operator norm. This result will be 
extremely useful in the subsequent analysis. 

Proposition 8: Given r] > 0, there is an event defined in terms of the design points T, with 
probability at least 1 — 0(n _T? ), such that on the set A^, 



$i || 2 < mc s ,i + v^cTXK^ 372 iog^) V {My/rfi log n)}, 



(31) 



for some constant d v > 0, and c 9j i is the constant in condition B2. Furthermore, for all T, we 
have the non-random bound 



< c^^mM, for all 



1, ... ,71. 



(32) 



Proof : First, (|32p follows from the bound SI, since 



II *f*< II 

m; M 

^ ^((pkiTij)) 2 < c^miM < c^fnM, for all i = 1, 

j=l k=l 



\\<\\ $i ||£=tr 



,n. 



In order to prove (|3ip . first write 



mi 



T 



1/2 



<b(t)m)Y g(t)dt = g~^ 



«7T JZlMT^MTv) - E(tej)toi))D)?i=i 



G 



-1/2 



0,M 

(33) 



20 



Next, observe that, E[^/ c (Tj 1 

that maxi< fe < M \\4>k\V 
by 



b^Ta)} 2 = f(4> k (t)) 2 (4>i(t)) 2 g(t)dt = for \k - l\ > 3; and is within 
1 ], for constants < 0^4 < c^s < oo, if \k — l\ < 3. Then, using the fact 
is bounded, it follows from Bernstein's inequality that the set A n defined 



-4, 



T : max max 

Ki<n KkA<M 



^ mi 

— J2[4> k (Tij)4>i(Tij) - nhiTi^UTij))] 

Tfli . 



, logn\ 
< di,„ ( -2- ] 



m 



'logn 
mM 



(34) 

has probability at least 1 — 0{n ri ), for some constant d\„ > 0. Now, we can bound the Frobenius 
norm of the matrix in (|33|) by using (|29p and (|34p. and the fact that the matrix has O(M) nonzero 
elements. Then using ([3]) we derive (131 p . 



Appendix B : Proofs for the sparse case 

Proof of Proposition 2 : The main challenge in the proof of Proposition 2 is to efficiently 
approximate the average Kullback-Leibler divergence. We can express K (Ej, S*j) as 

^ m 

Kpi, = - ^[Aj(-R*i) - log(l + \j(R*i))], (35) 

where \j(R*i) is the j-th largest eigenvalue of i?*, = (E*j — Ej)E^ . Using the inequality 

e x > 1 + x for i£l (so that each term in the summation in (|35p is nonnegative) , and the Taylor 
series expansion for log(l + x) for \x\ < 1, it can be shown that, given e > sufficiently small (but 
fixed), there exist constants < ci )6 < C2, e < oo such that for || R*i ||^< e, 

Cl, e || R*i ||f< ^"(Sj, £*j) < C2 j£ || i?« Hi' . (36) 

Next, observe that 



v -l/2/ v v n v — 1/2 II || v -l/2/ v v \ v ~l/2 

K^i ~ Zj *iJ Zj *i \\F ^ II r> II /II V^i ~~ ^"^w 

S II -K« F S 



i+ || £- 1/2 (s 4 - e»)£» 1/2 || F ■ " l- || Kl /2 (^ - s„)s« 1/2 \\ F ' 

whenever || E ■ (Ej — E^)E^ ||i?< 1. the proof of Proposition 1 can thus be reduced to finding 

probabilistic bounds for ^ Ya=i II ^i^i^i ~ Ill- 
One difficulty in obtaining those bounds is in handling the inverse of the matrices £*j. In order 
to address that, and some related issues, we use the properties of the cubic spline basis derived in 
Appendix A. In the following lemmas we confine ourselves to the restricted parameter space Q(a n ), 
i.e., (B,A) e 9(a n ). 

Lemma 3: Under the assumptions of Theorem 1 (for mi 's bounded), or Corollary 2 (for mi 's 
increasing slowly with n), 

(l+d 1 X 1 rm)~ 1 || Ej-E*j || F <|| £^ 1/2 (£j-£*j)£~. 1/2 || F <|| Ej-E*j \\ F , for alli = 1, . . . ,n, (37) 
for some constant d\ > 
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Proof : From condition A2 and (|30j) . it follows that, 3 D\ > such that, for all M, 

max II tb*u ||oo< Di < oo. (38) 

l<fc<r 

This, together with the definition of (-B*, A*), leads to the following bound on the eigenvalues of 
the matrices 

1 < A min (S*j) < A max (S*j) < 1 + DirniiX^i < 1 + diXirm, for alH = 1, . . . , n, (39) 
for some d\ > 0, from which (137p follows. 

Lemma 4: Under the assumptions of Theorem 1 (for irii 's bounded), or Corollary 2 (for 's 
increasing slowly with n ), given any r] > 0, on the event defined through |5^p in Proposition 8, 
which has probability at least 1 — 0(n~ v ), for sufficiently large n, 



max || II F 

Ki<n 



< 



,M 3/2 log 



n , 



V 



m 



M 2 log n 



m 



— 2 2 

m a n 



A [d 2 Im 2 ^] ,(40) 



where the second bound holds for all T. Here di,d2,d^ tV > are appropriate constants depending 
on r and \±. 

Proof : An upper bound for || Sj — S*j \\ F is obtained by expressing Im = B^B^ + (Im — B*B^), 
and then applying the triangle inequality, 

|| X, - HiHI $J(BAB T - B^Bj)^ \\ f 

< || <S>jB*(BjBAB T B* - A*)#f $; \\ F +2 || $jB*BjBAB T (I M - B*Bj)$i \\ F 
+ || $f (I M ~ B*Bl)BAB T {I M - B*Bj)$i \\ F 

< || || 2 || B?BAB T B* - A, \\ F +2 || |||| A |||| B T (I M - B*B?)<S> \\ F 
+ || A (HI <5>f(I M -B*Bj)B \\ 2 F 

< Dirm || BlBAB T B* - A* \\ F 

+y/d^rm\ 1 || $f (I M - B*Bj)B \\ F [l + (^rm)- 1 / 2 || $J(I M - B*B?)B , (41) 

for some c?4 > 1. For the second inequality we use || B^B ||< 1, and for the last inequality we use 
and (|39p . Next, by using (|32[) . (|17p and ([5]), we obtain the (nonrandom) bound 



m" 1 max || $f(I M - B*Bj)B \\ 2 f < c , 2 M || (I m - B*Bj)B \\ 2 f < c^Ma\(\ + o(l)) = o(l). (42) 

Kt<n 



Then the bound in (|41j) can be majorized by, 



Dirm || BjBAB T B* - A, || F + v / d 4 rmAi || || || (7 M - B*B^)B \\ f (1 + o(l)). 

Using (f3"T|) to bound || <&j ||, from ([IT]), and the definition of Q(a n ) together with (fl~6j) and (fT7|) . we 
obtain (|3Uj) . 

Lemma 5: Under the assumptions of Theorem 1 (for mi 's bounded), or Corollary 2 (for mi 's 
increasing slowly with n), for any given r] > 0, there is a positive sequence E\ jn = o(l) (depending 
on r\), and a constant d\^ > 0, such that 



1 n 

- y~] || Si - £ ri \\ F > di v m?o? {l - £i n ) 



i=l 



> 1 - 0(n 



-(2+2K)Mr-rn 
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where k is as in Bl. 

Proof : Let A = BAB T — B*A*Bj '. Observe that, by definition of 0(a n ) (equation ([15]) ). and 
equations (|16p and (|17p . for large n, 

|| A |||*< c*a 2 , (for some constant > 0). (43) 

First, consider the lower bound 

rrii 

|| HI = tr [$f A$i$f A<D 4 ] > £ [(^(%)) T A0(^- 2 )] 2 . (44) 

We are going to derive an exponential tail bound for - Y17=l Y^^j2[( < t , (Tij 1 )) T A<p(Tij s )] 2 . Rewrit- 
ing the term on the extreme right, and using the fact that {Tij}^ 1 are i.i.d. with density 
9 ■ Cg t o < g < c 9t i (B2), we have, for all i, 

rrii 

E £ [(0(!%)) T A0(:Zk a )] 2 = m*K-l)tr (E[0(Ti 1 )(0(ra)) T ]AE[0(T i 2)(0(Ti 2 )) T ]A) 

£ ( c g,oHi(l21 - 1) II A \\f, c 2 gl m(m- 1) || A 

€ (<m 2 a 2 (l+ (l)), <m 2 a 2 (l + o(l))) , (45) 

for some d'{ > d[ > (whose values depend on c g fl, c 9i i and the constants appearing in Al), where 
in the last step we use ([4"3]) . The last inequality uses ([72"]) . ([75]) . the definition of @(a n ), and the 
properties ([T6]) and ([T7]) . Notice that, the variance of ^™Sj 2 [(^(ri J - 1 )) r A0(Tij 2 )] 2 can be bounded, 
for sufficiently large n, as 



rrii 

maxVarl £ [(0(T yi )) T A0(T ija )i 2 



< max E || ^ - ||| ^ [(0(^ 1 )) T A0(^ 2 )] 5 

\ h¥=h 

< d' 2 M(m 2 a 2 n ) 2 =: V 1>n , 

where d' 2 > is some constant. In the above, we obtain the first inequality by (|44p . and the 
second inequality by using ([3D]) and ([35]). Next, Yl™Uj 2 [( ( l>( T m))' 1 'A0(2ij 2 )] s for i = 1, . . . , n, are 
independent, and bounded by i^i, n := ^Mm 2 ^, for a constant (f 4 > (using ([3D]) ). Hence, by 

' A/ log" 1 fT7—\ QT ,H fTF~ . /~MJ°El — 

n 

■2 „,2 1 



applying Bernstein's inequality, and noticing that Ki,ny — = °(\/Vi,n), and -^/TT 
o(m 2 a 2 ) (by ([5]), or (JBJ), the result follows. 



Lemma 6: Under the assumptions of Theorem 1 (for m,i 's bounded), or Corollary 2 (for rrii 's 
increasing slowly with n), for any given n > 0, there is a positive sequence e2, n = o(l) and a 
constant d 2t n > 0, such that 



i n 

E|| Si - S +i |||< d 2 ,r,m 2 a 2 l (l + e 2 . 



n 

i=l 



>l-0(n^ 2+2 ^ Mr -^), (46) 
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where K is as in Bl. 

Proof : From the proof of Lemma 4, especially the inequalities (|4ip and (|42p . it is clear that we only 
need to provide a sharp upper bound for ^ Y17=i II ^fi-^M — B*B^)B \\ 2 F . Let A := {Im — B*Bj)B . 
Then from (|17p . for n large enough, 

II A |||< c*a 2 n (47) 

for some c* > 0. Then, using ©, for all i, 

mi 



E || $i {Im - B*Bt )B ||| = > tr E(0(T i:/ ) W>(T 4J )r )AA 



< c 9i imitr [AA T ] < c Si im || A ||| . (48) 
Combining f|48j) with (|47p . (|4ip and (|42p . we get, for sufficiently large n, and some constant C > 0, 

1 ™ 

-VE (I Ei-S»i ||| < Cm 2 a 2 n 

Next, using (|42p . we have 



n 

8=1 



max Var(|| $f A |||) < max E || $f A ||| 

l<i<n l<j<n 

< c^mM || A ||| E || $f A ||| 

< c^^Cg^rffM || A ||| 

< C'Mm 2 a 4 n (l+e n ) =: F 2> „, (49) 

for some positive sequence e n = o(l) and some constant C > 0, where in the last step we used 
(f47l) . Again, using Bernstein's inequality for - Y17=i II ^"f A |||, which is a sum of independent 
variables bounded by K 2: n = c 9i iMma^(l + o(l)), the result follows (checking that, by ([5]) or ©, 

we have, K 2 , n \J = o(^/V 2 , n ) and ^V 2 ,n\[^^ = o{rn 2 a 2 n )). 

Proof of Proposition 3 : Write, Ri = E^E^ 1 - E^ 1 )^ 2 . We can bound || R { \\ F as 

II J? II ^ II v 1 / 2 ^- 1 / 2 INI im, v -l/2 v l/2 im, v -l/2/ v v Nv^ 1 / 2 !! 

|| .Kj || jr < || 2^ lj i mi ^ 2^ mi ^ 2^ IIH 2^ ~ \\F 

^ II v 1 / 2 ^- 1 / 2 ||2|| v -l/2 v l/2 ,|2|| v v I, 
S || 2,.,- || 2,- 2, • l^i - 2^ \\ F 



< (1+ || Ei - £» ||)(1- || Si - E» (I)" 1 || E 4 - E* 



i \\F, 



where the third inequality is due to (|39p . Note that, by condition C, it follows that maxi<j<n || 
Ej — E*j ||jr< C fn(3 n = o(l) for some constant C > 0. Therefore, applying (|40l) . we observe that for 
T € with as in (|34p . for large enough n, || i?j ||j?< 2 || Ej — £« \\ F . Due to the Gaussianity 
of the observations, for any symmetric rrii x mi matrix A, the random variable tr {A(Si — Ej)) has 
the same distribution as tr (Di{XiXf — I mi )), where Di is a diagonal matrix of the eigenvalues of 

\l 2 1/2 

Ej ^4£j , and Xi ~ N(0,I mz ) are independent. Therefore, using an exponential inequality for 
a weighted sum of independent x\ random variables, we have, for T £ A v and each (B, A) € 0(a n ), 

' n \ 1/2 



1 " 



> 1 - 0{n 



i=l 

(2+2re)Afr-m 



M log n 
n \ n 



< ^3,r]\j — | ~ / || Z-'i ~ Zj *i ||_F 

i=l 
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for a constant d% „ > 0. Therefore, using (|46p we conclude the proof. 



Proof of Proposition 4 : Using Cauchy-Schwarz inequality twice, we can bound the last term 
in (|19p . which corresponds to model bias, as 



1 n 



8=1 



1 

< - 

~ 2 



1 n 

A E 



8=1 



< — max 

2 l<j<ri 



J « ||F 



J *j \\F 



1/2 



1 ™ 

n ^ 



-1 1 1 2 



1 " 

-£ 

8=1 



8=1 



E~ 



1/2 



v -l i|2 

^♦i IIf 



1/2 



< \m[i n max || E ■ 1 



1 n 

-£ 



J *8 || F 



8=1 



1/2 



where in the last step we used ([71 p . Thus, the proof is finished by using condition C, f|39|) and (|46p . 
Now, to the complete the proof of Theorem 1, we give the details of the covering argument. 

Proof of Lemma 2 : Using an expansion analogous to f) 191) and the upper bound in (|36h . and 
applying Cauchy-Schwarz inequality, we have, for some constants C\,Ci > 0, on A^ rj and for 
T S A,,, for n large enough, 



\L n (B u A 1 )-L n (B 2 ,A 7 



< 



i n 

EN 

8 = 1 



J 2,8 llF| 



5; 



J 8 f 



J l,i 



^2,i IIf + II ^i i ~~ El 2 i || || Sj — S2,8 ||f 



< max || Ei 

Ki<n 



J 2,8 



-1 

J l,8 



II V II II vt-V2 c ^— 1/2 

Zjj max 2j,- OjZ^j 

l<8<n 

\p ( max || Ej — E*j ||^ + max 



\f +C 2 max || Eii 

l<8<n. 



J 2,i Hf 



J 2,8 



1<8<88 ' ' 1<8<71 l<j<n 



*8 Fj 



< d^fm 5 n mlog?i + m 5 n + m5 n (mf3 n + v Mma B )| = o(a n ). 

In the last step, we have used Lemma 4 (for the last term), the identity ([71]) in Appendix F, and 
the fact that || E~j ||< 1 (k = 1,2). 

Proof of Corollary 1: The best rate follows by direct calculation. 

The near-optimality of the estimator requires proving that, for an appropriately chosen subclass 
C of covariance kernels of rank r, we have the following analog of Theorem 2 of Hall et al. (2006): 
for any estimator {ipk}k = i of the eigenfunctions {^kYk=i-> f° r n sufficiently large, 



mm sup 



Kk<r 



E||&-V* \\ 2 2 >Cn- 8 / 9 , 



(50) 



for some C > 0. Here the parameter space C consists of covariance kernels of rank r with eigenfunc- 
tions satisfying A1-A2. Moreover, the random design satisfies B1-B2, with m bounded above. 

The derivation of the lower bound on the risk involves construction of a finite, "least favorable" 
parameter set in C by combining the constructions in Paul and Johnstone (2007) (for obtaining 
lower bounds on risk in high-dimensional PCA) and Hall et al. (2006) (for functional data case). 

be a set of orthonormal functions on [0, 1] which 



This construction is as follows. Let <ft[, 
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are four times continuously differ entiable, with fourth derivative bounded. Let M* X n 1 / 9 be an 
integer appropriately chosen. Let 71, ... , 7m* be a set of basis functions that are (i) orthonormal on 
[0, 1], and orthogonal to the set {(j)®, . . . , (ii) are four times continuously differentiable and 7^ 
is supported on an interval of length 0(M~ 1 ) around the point One particular choice for these 
functions is to let {(j)®} be the translated periodized scaling functions of a wavelet basis at a certain 
scale with adequate degree of smoothness, and to let be the set of compactly supported, 

orthonormal, periodized wavelet functions corresponding to the scaling functions. Indeed, then we 
can choose M* to be an integer power of 2. Note that, such a basis ({<^ : 1 < k < r} U {72 : 1 < I < 
M*}) has the stability and smoothness property commensurate with the orthonormalized B-spline 
basis we are using for deriving the REML estimators. Next, let Ai > • • • > A r > be fixed numbers 
satisfying Al. Finally, let us define a covariance kernel as 

r 

X { °\s,t) = ^\ k 4(s)4(t), s,t€ [0,1]. (51) 
fc=i 

Also, for each fixed j in some index set Tq (to be specified below), define 

[^\s):---:^\s)\=*{s)B {J \ s £ [0, 1] 

where ^(s) = [4>\{s), . . . , 4>®{s), 7i(s), ■ ■ ■ , 7a/* ( s )) and B^ is an (M* + r) x r matrix with orthonor- 
mal columns (to be specified below). Then define 



4 J W) =E A fctf ) ( s )tf ) (*)' s ,t £ [0,1], (52) 
k=l 



for j £ F . We require that log|J" | ~ M* ~ ™ 1/9 , and || £ W - B b>) ||f,x n~ 8 / 9 , for j / j' and 
j, j' G JT U {0}. Here -B^ is the (M* + r) x r matrix of basis coefficients of Sq°^ with columns e&, 
the fc-th canonical basis vector in ~R M * +r . 

The proof of the minimax lower bound is based on an application of Fano 's Lemma (Yang and 
Barron, 1999), which requires computation of the Kullback-Leibler divergence between two specific 

values of the parameters. In order to apply Fano's lemma, we need to choose !Fq and B ,j& !Fq, 
such that 

ave^„Et,^',E|°»)] + lo g 2 oe 
log I Jo I 

where £j denotes the covariance of the observation i given {^i; }/=*i under the model parameterized 
— (j) 

by S , and E denotes expectation with respect to the design points T. Under the assumptions 
on the design points, using the properties of the basis functions {4>\Y=\ and {lk]k=v an d the 
computations carried out in the proof of Proposition 2 (in Appendix B), in particular a nonrandom 
bound analogous to the second bound appearing in Lemma 4, it is easy to see that for n large 
enough (so that || B^ — B^ \\p is sufficiently small), we have 



_. n 1 n 

n n 

i=l i=l 
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From this, and the property of the basis used to represent the eigenfunctions, it follows that 

i n 



n 



VEi^pfUf') x|| B W -W } ||| . (54) 



i=l 



The task remains to construct J-q and i? appropriately so that Co := {S : j £ .Fo} is in C, for 
n sufficiently large. 

Following the proof of Theorem 2 in Paul and Johnstone (2007), we first define Mo = [^r]- 
Then define the fe-th column of B as 



B ( k j) = Jl-5 2 k e k + 5 k ^ *ki e r+h k = 1, . . . , r, (55) 



i=i 

where zj!) are appropriately chosen using a "sphere packing" argument (to ensure that log|.Fo| x 

2/2 1/2 

M*), and take values in {— M , 0, M }. Moreover, let Sk be the set of coordinates I such 
that Zui 7^ for some j 6 J-q- By construction, are disjoint for different = 1, . . . , r, and 
~ M*/r. Hence, 



W = \/l-%<l>k + Sk 1 Z,4 'H, k = l,...,r. (56) 

Furthermore, by the construction of {z^}, Y2ies k \ z ki\ 2 = ^> an< ^ ^ or an y the vec t° rs 

z k ] = ( z kl)leS k and ) = {z[ j P) leSk satisfy || - zf ) || 2 > 1. Therefore, from {55]) it follows 
that the RHS of (|54|) is of the order 5?, and hence in order that (|53|) is satisfied, we need to choose 

Sk ~ n~ 4//9 x M~ 4 . It follows immediately from ([56]) that (i) the eigenfunctions ijyp , . . . ,tpr ar e 
orthonormal, and four times continuously differentiable. Also, since 7/ is centered around l/M* 
with a support of the order 0(M~ 1 ), it follows that, for only finitely many I ^ l', the support 

(s) 

of 7/' overlaps with the support of 7;. Moreover, if 7^ denotes the s-th derivative of 7;, then 
II 7; l|oo= 0{Ml^ 2+s ), for s = 0, 1, . . . , 4. Thus, the choice <5fc x M~ 4 ensures that, (ii) for each 

k = 1, . . . ,r, the fourth derivative of is bounded. Hence, by appropriate choice of the constants, 
we have that Cq C C. Finally, arguing as in Paul and Johnstone (2007), with an application of the 
Fano 's Lemma we conclude (1501) . 



Appendix C : Proof of Proposition 5 

Using standard arguments, it can be shown that, given e > sufficiently small (but fixed), we have 
constants < c\ e < C2. e < °o such that for || r* 1//2 (r* — r)r* ||^< e, 

ci, e || r; 1/2 (r* - r)r; 1/2 f F < k(t,t,) < c 2 , e || r; 1/2 (r, - r)r; 1/2 \\ 2 F . (57) 

]/2 1 /2 

Thus, it suffices to provide tight bounds for || (T — r*)r* \\p. We introduce some notations 

first. Define, G = ^A" 1 + I r , G* = ^A^ 1 + I r and A = BAB T - £*A*£j . Then, 

T- 1 = -1(J M - B(— A" 1 + 1,)" 1 ^) = -i(/ M - BG- l B T ), and T; 1 = \{I M - B.G^Bj). 
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Moreover, due to Al', there exist constants, 03,04 > 0, such that, 

C3 (~~) - a rnin{Ir ~ G' 1 ) < a max {I r ~ G" 1 ) < C 4 (~^J • 

We express Im — B*G~ 1 B]: as (Im — B*B^) + B*(I r — G~ 1 )Bj. Then we can express 

(a 2 / s f || r; 1/2 (r - r*)r; 1/2 1^= o- 4 1| r; 1/2 Ar; 1/2 ||f, as 

tr [(I M - B*G; 1 J Bj)A(J M - B.G-^JjA] 
= tr [(Im ~ B*Bj)BAB T (I M - B*Bj)BAB T ] + 2tr [B,(J r - G^)Bj BAB t (I m - B*Bj)BAB T ] 

+tr [B*(I r - G-^BjAB^Ir - G~ 1 )BjA] 
= II {Im — B*Bl)BAB T (Im — B*Bj) \\ f 

+2tr [(/, - GT 1 ) 1 / 2 B^BA(B T (I M - B*Bj)B)AB T B*(I r - G^ 1 ) 1 / 2 } 

+ || (i r - g;^ 2 (bIbab t b* - A,)(/ r - g;Y /2 f F 

2 / 2 \ ^ 

> 2c 3 A 2 ( C 7 min ( J Bj J B)) 2 — || (Im - B*B*)B ||| +c| I — || bJbAB t B* - A, ||| 



.s 



> c 4 || (/ A/ - \\ 2 F +4 f^pj || BjBAB T B, - A, (I 2 , (58) 

for constants C3, C4 > 0. Now, since (B, A) 6 Q(a n ), where B = exp(l, B^Ajj + Cjj) it follows that 
II Ajj ||j?< a ny J~^ and || C[/ ||j?< a n . Moreover, from (fT6j) . and using the fact that Ajj = —Ay, we 
have, 

Bj BAB T B* - A, = D + (Aj/A - A^) + 0(|| ^ ||| + || D ||| + || £/ || F (|| ^ \\ F + || Q; || F )). 

Since Ay A — AAu is symmetric, has zeros on the diagonal, and its Frobenius norm is bounded 
below by mini<j < fc< T .(Aj — A&) || Ajj \\ F , and D is diagonal, it follows that for some constant cq > 0, 

|| BjBAB T B* - A, |||> c 6 (|| D ||| + || A a |||) - 0((-^fl 2 c? n ). 



From this, and using (|17p to approximate the first term in (|58p . it follows that for some constant 
c 7 > 0, 

|| (/m - B*G^BJ)^A(I M - B*G-tB?)V* ||| 



, a 2 
> c 7 ( T 



2 2 

Cj/ II 2 , +— || Au Hi +— || D \\ 2 F -0(J^a 6 n ) 
s s V <y 



= c 7 (^pj a 2 (l - o(l)). (59) 
The last equality is because a n y^7 = o(l). Also, it is easy to show now that, for some cs > 

|| (I M - B^B^^hi - B*G- l B?) 1 / 2 f F < c s « 2 (1 + o(l)). (60) 

Hence, from (|59p and (|60p . it follows that, there are constants eg, cio > such that, for sufficiently 
large n, 

cga^y^ <|| r^r 1/2 (r - r*)r* 1/2 \\ F < c 10 a n J^, 

which, together with ([57j) . proves ([23]) . 
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Appendix D : Score representation in the matrix case 

First, define the canonical metric on the tangent space T B (BW of the parameter space (l = SM,r®^ r 
(for0 =(5,0) by 

(X, Y) g = (X B , Y b ) c + (X Y c ), for X B , Y B G T B , X ? , Y c G R r , 

where (X B ,Y B ) C = tr (X b (Im — \BB T )Y B ) is the canonical metric on Sm,t an d (-^C'^c) = 
tr (Xjy^) is the usual Euclidean metric. Next, for an arbitrary 9, write L n (9) = F^(9) + F 2 (9), 

where F^{9) = tr(r -1 S) and F 2 (9) = log |T| = log |7 r + e^\ = log | I r + A|. Similarly, we write 
L(9;9*) = F l (9;9*) + F 2 (9;9*), where F l (9;9*) = tr^r*) and F 2 (9;9*) = F 2 (9). Below, we 
shall only give expressions for gradient and Hessian of F^(-) and F 2 (-), since the gradient and 
Hessian of F 1 ^;^) and F 2 (-;#*) follow from these (by replacing S with r*). 

Gradient and Hessian 

From Appendices B and D of Peng and Paul (2007), we obtain expressions for the gradient and 
Hessian of F^(-) and F 2 {-). We mainly follow the notations used there. Let, P := P{9) = Im + 
BAB T . Then P" 1 = I M - BQ- 1 B T , where 

Q ■= Q(9) = A" 1 + B T B = AT 1 + I r Q^ 1 = A(I r + A) -1 . 

The fact that Q is independent of B is of importance in the calculations throughout. Use F^ B (-) 

to denote the Euclidean gradient of F^(-) w.r.t. B. It is easy to see that F^ B {9) = —2SBQ~ 1 . 
Then, under the canonical metric the intrinsic gradient is given by 

V fl i£(0) = Fl B (9) - B(Fl B (9)) T B = 2[BQ- 1 B T SB - SBQ^ 1 ]. 

Since F 2 {9) does not involve B, the Euclidean gradient F 2 B {9) = 0, and hence V B F 2 (9) = 0. 
Therefore, 

V B L n {9) = V B F^{9) = 2[BQ~ 1 B T SB - SBQ' 1 }. (61) 

Next, for X B G T B , let G\ BB (-)(X B ) be the Euclidean Hessian operator of F^(-) evaluated at X B . 
It is computed as 

G njBB (9)(X B ) = —2SX B Q~ . 

The Hessian operator of L n (-) w.r.t. B, equals the Hessian operator of F^(-) w.r.t. B. For 
X B , Y B G T B , it is given by 

H n>B (9)(X B ,Y B ) = tr (Y^Gl BB (9)(X B ))+hr [({F^ B (0)) T X B B T + B T X B {F^ B (d)) T ) Y B ] 

-itr [{B T Fl B (9) + (Fl B (9)) T B) X T B {1 M - BB T )Y B ] . (62) 

For computing gradient and Hessian with respect to £, we only need to compute first and second 
derivatives of the function L n {-) (equivalently, of F^(-) and F 2 {-)). Using calculations carried out 
in Appendix D of Peng and Paul (2007), and the identity P~ l B k = (1 + X k )~ 1 B k where B k is the 
fc-th column of B, 1 < k < r, we have, 

dF l i ) = - e C kB r p -isp-i Bk = --^-,BlSB k , and ^-(9) = e^B T k P^B k 



d( k w k (l + X k ) 2 " d( k y ' k 1 + A fc ' 

(63) 
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Thus 



V ( L n {6) = diag 



;(1 + Xk-Bl 



SB k )\ . 
J k=l 



(1 + A fc ) 2 

Since &£p- x B l = 0, for 1 < k ^ I < r, it follows that g£§i (0) = for fc / Z, i = 1, 2. Also, 



T {9) = e^BlP~ l SP- l B k 2e^{Blp- l B k ) - 1 



(9) = e^BlP~ l B k \l - e^{BlP^B k ) 



Afc(A fc - 1) pr g; p 
Afc 



Thus, the Hessian operator of L n (-) w.r.t. C, is given by 



A 



H n>< {9) = diag 1 (1 + A )3 ((A fc - 5B fe + (1 + A 



fe=i 



(64) 



Boundedness and inversion of H(9*;9*) 



As discussed in Section [6l the Hessian operator H(9*;9*) is "block diagonal". So, we only need to 
show the boundedness and calculate the inverse of B B {9*;9*) and H^(9^;9^). First note that, from 
(f64"|) we have, 

# c (0*;0*) = diag ( ^ ((K k - l)B^ k T,B, k + (1 + A**))) = A 2 (I r + A,)" 2 , 

V U + ^*k) J k= \ 

which is clearly positive definite with eigenvalues bounded away from and oo, due to conditions 
Al' and C". 

Next, we show that H B (0*;6 m )(X,X) > C(X,X) C , for some C > 0, for all X G T Bt . Define 
F B {9*;9*) = E fl .F,} B (0*) and G BB (9*;9*) = E et G l nBB (9*). Note that H B (0*;0*) is obtained by 
replacing F x B {9*) and G\ BB (9*) by F B (9*) and G BB (9*;9*), respectively, in ([62]) . Observe that, 

F B {9*) = -2T*B*Q~ l = -2B,{I r + A,)Q~ l = -2B*A*, 

where Q* = Q(9*) = A^r 1 (/ r + A*), and we have used the fact that T*B* = B*(I r + A*). For 
notational simplicity we use F B to denote F B (0*). Note that, for X G T B „ X = B*A X + (I - 
B*BT)Cx, where Ax = —A x G W xr , and Cx £ M A/XT '. Using this representation, for any 
X, Y G T Bt , we have 



and 



tr [((FhfXB? + B?X(Fh) T )Y 



tr [A*BjXB;Y + BjXA^Y 



tr [A*X T B*B*Y + B^XA*Y T B*] = 2tr [A^B^f Y] , 



(65) 



"^tr [(B2T (F i ) + iftfB.) X T (I M - B.BJ) Y] 

tr [(5jB»A» + A*BjB*)X T {I M - B*Bj)Y] = 2tr [A*A T (/ A/ - B*Sj)y] . (66) 
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Next, notice that, for X £ T Bm , G BB (6*; 9*)(X) = -ZT^XQ- 1 . Therefore, 

tr [Y T G BB (0*-0*)(X)] =-2tr [y^XQ^ 1 ] 

= -2tr [y r (/ M - B.B^XQZ 1 ] - 2tr [y T £?*(I r + A^XQ" 1 ] . (67) 

Now, combining (|65|) . (|66|) and (|67|) . and using the definition of Hb(9*; #*), and the facts that 
Ir — Q* 1 = (Ir + A.*)" 1 , X T B* = A x = —Ax and B^Y = Ay = —Ay, after some simple algebra 
we have, 

H B (9*;9*)(X,Y) 

= 2 (tr [X T B,KB^Y{I T + A,)- 1 ] - tr [X t B*bJyQ~ 1 ] ) + 2tr [A 2 (/ r + A*)" 1 ^!,. - B*B?)Y] 

(68) 

Again, since A^ = — A x and Ay = —Ay, denoting by Ax,ij and Ay,ij, the (i, j)-th element of Ax 
and Ay respectively, we have 

tr [X T B*A*BT Y(I + A,)" 1 ] - tr [X T B^BjYQ^} 
= -tr [A x (A*A Y (I r + A*) - AyA*(/ r + A*)" 1 )] 



i=1 j=l 

= E E •'v'.-u,, ) = E E A ** A ™ [( A « - A *i) (rr>- - rh~) 

i=l j=l «=1 .7=1+1 J 

i ~ 1 r (A • - A ) 2 1 r r (A • - A ) 2 

= E E A x,iJ A Y,ij (1 + x^)(i + a ■) = 2^^^^(l + r„)(l + A^)- (69) 

1=1 J=J+1 !=1 J=l J 

Since min 1 < fc ^ fc /< r (A*fc — A*fc/) 2 (1 + A*fc)~ 2 > Ci*, and A* r > C*2, for some constants Ci*,C*2 > 
(value depending on ci and C2 appearing in Al'), it follows from (f68l) and (|69|) that for X £ Tb*, 

ff B (0*; > C*itr (X T B*Blx) + 2C* 2 tr (X T (/ A/ - B*B?)X) 

> a 3 tr (X T X) , 

where C*3 = min{C*i, 2C*2}- This proves that Hb(9*] 9*)(X, X) is bounded below in the Euclidean 
norm and hence in the canonical metric because of the norm equivalence. An upper bound follows 
similarly. 

Proof of Corollary 3 : From (168j) and (169p . we can derive an explicit expression of H B 1 (9^;9^). 
Note that H B 1 (9^;9^)(X) is defined as 

H B {9,;9,)(H B 1 (9,;9,)(X),Y) = {X,Y) C , for any Y £ T B ,. 

Therefore, for X £ Tb* and Ax = Bj^X, 

H B \6M{X) = l B . ^ ft+M^+^ A^)) + " B*Bj)XA~ 2 (I r + A*). 
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Using this, we can now get an explicit expression for H B (9*; #*)(VsL n (#*)). From (fUTJ) . we have 



BjV B L n (6*) = 2 



— 911 (Ari - A*j) T ~ 
" 2 VV(1 + A„)(1 + A^^^ 

Also, 

(Jm — -B*-&T)V BL n (B*, A*) = —2(Im - B^B^)SB^Q~ l . 

Thus, it follows that 

-H B \9,- 9,) (v B Ln(0*)) = " (( (A - -A ■) B *^ B *^j) + (/m " B * B *)S B *K l 



RiSi?*! : • • • : ~R r SB- 



Appendix E : Gradient and Hessian on product manifolds 

In this section, we give a brief outline of the intrinsic geometry associated with the product manifold 
of two Riemannian manifolds, and as an application we consider the manifold S^r ® which is 
the parameter space for (B,() in our problem. 

Product Manifolds 

Consider two Riemannian manifolds: Ai,J\f with metrics gM and gw, respectively. The product 
manifold V of M, M is then defined as: 

V ■= M&N = {(x, y) : x £ M,y £ M} 

with the tangent space at a point p = (x, y) £ V, 

T P T := T X M © T y M 

where T x M,T y M are tangent spaces of A4,M at points x,y, respectively. The Riemannian metric 
g on the tangent space TV is naturally defined as 

(Ti,T 2 ) g : = (6,6)<?m + (VUV2) 9N , 

where T { = (&,77i) £ TV, with &eTM and m £ TN (i = 1, 2). 

By the above definition of the product manifold V, the intrinsic gradient and Hessian of a 
smooth function / defined on V are as follows: 

• Gradient: 

where (fj\f) is / viewed as a function on M. {M); and V» (Vjv ) denotes the gradient 
operator for functions defined on A4 (AA). 
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• Hessian: for T t = £ TV (i = 1,2), 

H f {T u T 2 ) = H fM (H 1 ,d 2 ) + (V M (V M f M ,Z 1 ) 

+ (Vx(V^/at,??i)^,6)sm + H fu(vi,V2)- 

The above expression is derived from the bi-linearity of the Hessian operator and its definition. 
Also note that 

{^u{^MfM,il)g hn V2)g N = (V 'm(^ 'N Jn ,m) g N > £0 'g M - 

Application to the product of a Stiefel manifold and an Euclidean space 

Consider the special case: A4 = Sm,t with the canonical metric (•, -) c , and Af = M, d with Euclidean 
metric. For a point p = (B, x) on the product manifold V, the tangent space is 

T p V = T B M © T X N, 

where 

T B M = {A G M Mxr : B T A = -A T B}, and T X N = M. d . 
For a smooth function / defined on the product space V: 

• Gradient (at p): 

v/k=(v«/.f 

where V M f |, = f B - B/JB (with f B = £J). 

• Hessian operator (at p): for T = (A, a), and forX = (X B ,rj) G T P V, 

8 d d 2 f 

H f (T,X)\ p = H fM (A,X B ) + (—(V M f,&)cV) + (-^(V M f,X B ) c ,a)+a T ^ri, (70) 

where 

H fM (A,X B ) \ p = f BB (A,X B )+^Tr [{flAB T + B T Afl)X B ]- l -Tr [(B T f B + f]^B)A T nX B ] 
with n = I - BB T . 

• Inverse of Hessian operator (at p): for G G T p V, T = Hj 1 (G)\ p is defined as: T = (A, a) G 
TpV such that for any X = (X B ,r)) G T P V the following equation is satisfied 

H f (T,X)\ p = (G,X) g . 
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Appendix F : Some inequalities involving matrices 

In this paper we make frequent use of the following matrix inequalities: 

• For any A, B, 

II AB ||f<|| A \\f\\ B II, and || AB ||f>|| A \\p X m i n (B), (for B positive definite) 
where X m i n (B) is the smallest eigenvalue of B. Also, if A and B are invertible then 

A" 1 - B" 1 = A~ X {B - A)B~ l = B~ 1 {B - A)A~ l . (71) 

• Weilandt's inequality (Horn and Johnson (1994)): For symmetric p x p matrices A, B with 
eigenvalue sequences Xi(A) > ■ ■ ■ > A p (^4) and X±(B) > ■ ■ ■ > X P (B), respectively, 

v 

]T 1^,4) -A, (5) | 2 <\\A-B ||| (72) 
i=i 

• Eigenvector perturbation (Paul (2005)): Let A be a p x p positive semidefinite matrix, with 
j-th largest eigenvalue Xj(A) with corresponding eigenvector pj, and Tj := max{(Aj_i(A) — 
Aj(^))" 1 , (Xj(A) - Xj+iiA))" 1 } is bounded (we take A (-4) = oo and X p+ i(A) = 0). Let B 
be a symmetric matrix. If denotes the eigenvector of A + B corresponding to the j-th 
largest eigenvalue (which is of multiplicity 1, for || B \\ small enough, by (|72p ). then (assuming 
without loss of generality qjpj > 0), 

II qi-Pill< 5^ + 4f^i) . (73) 
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